{
 "nbformat": 4,
 "nbformat_minor": 2,
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "try:\n",
    "    import evidently\n",
    "except:\n",
    "    get_ipython().system('pip install git+https://github.com/evidentlyai/evidently.git')"
   ],
   "outputs": [],
   "metadata": {
    "id": "u8uzrK81m71s",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.datasets import fetch_openml\n",
    "\n",
    "from evidently import ColumnMapping\n",
    "from evidently.report import Report\n",
    "from evidently.metric_preset import DataDriftPreset"
   ],
   "outputs": [],
   "metadata": {
    "id": "3sgfSUPzm-oZ",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data = fetch_openml(name='adult', version=2, as_frame='auto')\n",
    "df = data.frame\n",
    "df.head()"
   ],
   "outputs": [],
   "metadata": {
    "id": "CHmF2OsvnGbd",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Let's add two features to illustrate, that we choose stat test depending not just on its type, but also on a number of unique values.\n",
    "\n",
    "Also, we will keep in mind that these features are absolutely random, so we don't expect any drift here."
   ],
   "metadata": {
    "id": "mUbZ19AZnNy1",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "df['num_feature_with_3_values'] = np.random.choice(3, df.shape[0])\n",
    "df['num_feature_with_2_values'] = np.random.choice(2, df.shape[0])"
   ],
   "outputs": [],
   "metadata": {
    "id": "so4sWZsznOLp",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "numerical_features = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week', 'num_feature_with_3_values', 'num_feature_with_2_values']\n",
    "categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'class']\n",
    "column_mapping = ColumnMapping(numerical_features=numerical_features, categorical_features=categorical_features)"
   ],
   "outputs": [],
   "metadata": {
    "id": "QXTlWEQAnQ9l",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## small dataset"
   ],
   "metadata": {
    "id": "lH8cIWmgnU-Q",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### no difference"
   ],
   "metadata": {
    "id": "GwPADZ-GnVDQ",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We created 2 small random samples, so we do not expect to see any drift here."
   ],
   "metadata": {
    "id": "fAcxHdn2nZ-P",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_report = Report(metrics=[\n",
    "    DataDriftPreset(),\n",
    "])\n",
    "\n",
    "data_drift_report.run(\n",
    "    reference_data=df.sample(1000, random_state=0), \n",
    "    current_data=df.sample(1000, random_state=10), \n",
    "    column_mapping=column_mapping\n",
    ")\n",
    "data_drift_report"
   ],
   "outputs": [],
   "metadata": {
    "id": "oe4x-l6EnZPB",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "When you're working with small datasets, it's more likely that you'll get different distributions by chance. But it can also be concluded that statistical tests are quite sensitive."
   ],
   "metadata": {
    "id": "sJb7V0ZTnknk",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### data shifted"
   ],
   "metadata": {
    "id": "6t3RWNwhnn9u",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We split data in 2 samples by relationship status, so we do expect to see some drift here."
   ],
   "metadata": {
    "id": "C1qX_m09nsBk",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_report = Report(metrics=[\n",
    "    DataDriftPreset(),\n",
    "])\n",
    "\n",
    "data_drift_report.run(\n",
    "    reference_data=df[df.relationship.isin(['Husband', 'Wife'])].sample(1000, random_state=0), \n",
    "    current_data=df[~df.relationship.isin(['Husband', 'Wife'])].sample(1000, random_state=10), \n",
    "    column_mapping=column_mapping\n",
    ")\n",
    "data_drift_report"
   ],
   "outputs": [],
   "metadata": {
    "id": "Ew35RtMQnfjp",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## big dataset"
   ],
   "metadata": {
    "id": "J4DTl121oc2m",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### no difference"
   ],
   "metadata": {
    "id": "_CpUZ2t1ogoQ",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We created 2 small random samples, so we do not expect to see any drift here."
   ],
   "metadata": {
    "id": "sBpFjZ9EohAH",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_report = Report(metrics=[\n",
    "    DataDriftPreset(),\n",
    "])\n",
    "\n",
    "data_drift_report.run(\n",
    "    reference_data=df.sample(30000, random_state=0), \n",
    "    current_data=df.sample(30000, random_state=10), \n",
    "    column_mapping=column_mapping\n",
    ")\n",
    "data_drift_report"
   ],
   "outputs": [],
   "metadata": {
    "id": "4XuKlbmAodY-",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### data shifted"
   ],
   "metadata": {
    "id": "-BuvVeo1ovtW",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We split data in 2 samples by relationship status, so we do expect to see some drift here."
   ],
   "metadata": {
    "id": "ZvlhPW2jozmr",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_report = Report(metrics=[\n",
    "    DataDriftPreset(),\n",
    "])\n",
    "data_drift_report.run(\n",
    "    reference_data=df[df.relationship.isin(['Husband', 'Wife'])].sample(30000, random_state=0, replace=True), \n",
    "    current_data=df[~df.relationship.isin(['Husband', 'Wife'])].sample(30000, random_state=10, replace=True), \n",
    "    column_mapping=column_mapping\n",
    ")\n",
    "data_drift_report"
   ],
   "outputs": [],
   "metadata": {
    "id": "W5YSxoPbo0Yl",
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Support Evidently\n",
    "Enjoyed the tutorial? Star Evidently on GitHub to contribute back! This helps us continue creating free open-source tools for the community. https://github.com/evidentlyai/evidently"
   ],
   "metadata": {}
  }
 ]
}